On the optimality of Allen and Kennedy's algorithm for parallelism extraction in nested loops
نویسنده
چکیده
We explore the link between dependence abstractions and maximal parallelism extraction in nested loops. Our goal is to find, for each dependence abstraction, the minima] transformations needed for maximal parallelism extraction. The result of this paper is that Allen and Kennedy's algorithm is optimal when dependences are approximated by dependence levels. This means that even the most sophisticated algorithm cannot detect more parallelism than found by Allen and Kennedy's algorithm, as long as dependence level is the only information available. 1 I n t r o d u c t i o n Many automatic loop parallelization techniques have been introduced over the last 30 years, starting from the early work of Karp, Miller and Winograd [12] in 1967 who studied the structure of computations in repetitive codes called systems of uniform recurrence equations. This work defined the foundation of today's loop compilation techniques. It has been widely exploited and extended in the systolic array community, as well as in the compiler-parallelizer community: Lamport [14] proposed a parallel scheme the hyperplane method in 1974, then several loop transformations were introduced (loop distribution/fusion, loop skewing, loop reversal, loop interchange, . . . ) for vectorizing computations, maximizing parallelism, maximizing locality and/or minimizing synchronizations. These techniques have been used as basic tools for optimizing algorithms, the most two famous being certainly Allen and Kennedy's algorithm [1], designed at Rice in the Fortran D compiler, and Wolf and Lam's algorithm [18], designed at Stanford in the SUIF compiler. At the same time, dependence analysis has been developed so as to provide sufficient information for checking the legality of these loop transformations, in the sense that they do not change the final result of the program. Different abstractions of dependences have been defined (among others dependence distance [16], dependence level [1], dependence direction vector [19], dependence polyhedron or cone [1i], . . . ) , and more and more accurate tests for dependence analysis have been designed (among others Banerjee's tests [2], I test [13], A test [9], A test [15], PIP test [7], PIPS test [10], Omega test [17], . . . ) . * Supported by the CNRS-INRIA project ReMaP.
منابع مشابه
On the Optimality of Allen and Kennedy's Algorithm for Parallel Extraction in Nested Loops
We explore the link between dependence abstractions and maximal parallelism extraction in nested loops. Our goal is to nd, for each dependence abstraction, the minimal transformations needed for maximal parallelism extraction. The result of this paper is that Allen and Kennedy's algorithm is optimal when dependences are approximated by dependence levels. This means that even the most sophistica...
متن کاملParallelizing Nested Loops with Approximations of Distance Vectors: A Survey
Received (received date) Revised (revised date) Communicated by (Name of Editor) ABSTRACT In this paper, we compare three nested loops parallelization algorithms (Allen and Kennedy's algorithm, Wolf and Lam's algorithm and Darte and Vivien's algorithm) that use diierent representations of distance vectors as input. We study the optimality of each with respect to the dependence analysis it uses....
متن کاملPlugging Anti and Output Dependence Removal Techniques into Loop Parallelization Algorithms Ecole Normale Supérieure De Lyon Plugging Anti and Output Dependence Removal Techniques into Loop Parallelization Algorithms Plugging Anti and Output Dependence Removal Techniques into Loop Parallelization Algorithms
In this paper we shortly survey some loop transformation techniques which break anti or output dependences, or artiicial cycles involving such \false" dependences. These false dependences are removed through the introduction of temporary buuer arrays. Next we show how to plug these techniques into loop parallelization algorithms (such as Allen and Kennedy's algorithm). The goal is to extract as...
متن کاملEXECUTING NESTED PARALLEL LOOPS ON SHARED - MEMORYMULTIPROCESSORSSadun
Cache-coherent, bus-based shared-memory multiprocessors are a cost-eeective platform for parallel processing. In scientiic parallel applications, most of the computation involves processing of large multidimensional data structures which results in a high degree of data parallelism. This parallelism can be exploited in the form of nested parallel loops. Most existing shared memory multiprocesso...
متن کاملExecuting Nested Parallel Loops on Shared-Memory Multiprocessors
Cache-coherent, bus-based shared-memory multiprocessors are a cost-e ective platform for parallel processing. In scienti c parallel applications, most of the computation involves processing of large multidimensional data structures which results in a high degree of data parallelism. This parallelism can be exploited in the form of nested parallel loops. Most existing shared memory multiprocesso...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 1996